AITopics

Country:

Oceania (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.93)
(4 more...)

Neural Information Processing SystemsFeb-8-2026, 11:14:28 GMT

Language-AugmentedVisualModels

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks. However, it remains challenging to evaluate the transferablity of these models due to the lack of easy-to-use evaluation toolkits and public benchmarks. To tackle this, we buildELEVATER 1, the first benchmark and toolkit for evaluating (pre-trained) language-augmented visual models. ELEVATERis composed of three components.

artificial intelligence, machine learning, natural language, (19 more...)

Country: Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsAug-15-2025, 08:32:38 GMT

K-L ITE: Learning Transferable Visual Models with External Knowledge

Our code is available at https://github.com/microsoft/klite .

arxiv preprint arxiv, dataset, knowledge, (14 more...)

Country:

Oceania (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Knowledge Management > Knowledge Engineering (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.93)
(4 more...)

Chapin, Alexandre, Machado, Bruno, Dellandrea, Emmanuel, Chen, Liming

Object-Centric Representations Improve Policy Generalization in Robot Manipulation

arXiv.org Artificial IntelligenceMay-20-2025

Visual representations are central to the learning and generalization capabilities of robotic manipulation policies. While existing methods rely on global or dense features, such representations often entangle task-relevant and irrelevant scene information, limiting robustness under distribution shifts. In this work, we investigate object-centric representations (OCR) as a structured alternative that segments visual input into a finished set of entities, introducing inductive biases that align more naturally with manipulation tasks. We benchmark a range of visual encoders-object-centric, global and dense methods-across a suite of simulated and real-world manipulation tasks ranging from simple to complex, and evaluate their generalization under diverse visual conditions including changes in lighting, texture, and the presence of distractors. Our findings reveal that OCR-based policies outperform dense and global representations in generalization settings, even without task-specific pretraining. These insights suggest that OCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.

artificial intelligence, machine learning, representation, (19 more...)

2505.11563

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceOct-11-2024

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Chen, Huayu, Su, Hang, Sun, Peize, Zhu, Jun

Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by language model alignment methods, we propose \textit{Condition Contrastive Alignment} (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning ($\sim$ 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2410.09347

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)
(2 more...)

Castro, Diego, Eloy, Christophe, Ruffier, Franck

Visual collective behaviors on spherical robots

arXiv.org Artificial IntelligenceOct-6-2024

The implementation of collective motion, traditionally, disregard the limited sensing capabilities of an individual, to instead assuming an omniscient perception of the environment. This study implements a visual flocking model in a ``robot-in-the-loop'' approach to reproduce these behaviors with a flock composed of 10 independent spherical robots. The model achieves robotic collective motion by only using panoramic visual information of each robot, such as retinal position, optical size and optic flow of the neighboring robots. We introduce a virtual anchor to confine the collective robotic movements so to avoid wall interactions. For the first time, a simple visual robot-in-the-loop approach succeed in reproducing several collective motion phases, in particular, swarming, and milling. Another milestone achieved with by this model is bridging the gap between simulation and physical experiments by demonstrating nearly identical behaviors in both environments with the same visual model. To conclude, we show that our minimal visual collective motion model is sufficient to recreate most collective behaviors on a robot-in-the-loop system that is scalable, behaves as numerical simulations predict and is easily comparable to traditional models.

information, robot, visual collective behavior, (14 more...)

2409.20539

Country: Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.64)

arXiv.org Artificial IntelligenceJul-30-2024

AxiomVision: Accuracy-Guaranteed Adaptive Visual Model Selection for Perspective-Aware Video Analytics

Dai, Xiangxiang, Zhang, Zeyu, Yang, Peng, Xu, Yuedong, Liu, Xutong, Lui, John C. S.

The rapid evolution of multimedia and computer vision technologies requires adaptive visual model deployment strategies to effectively handle diverse tasks and varying environments. This work introduces AxiomVision, a novel framework that can guarantee accuracy by leveraging edge computing to dynamically select the most efficient visual models for video analytics under diverse scenarios. Utilizing a tiered edge-cloud architecture, AxiomVision enables the deployment of a broad spectrum of visual models, from lightweight to complex DNNs, that can be tailored to specific scenarios while considering camera source impacts. In addition, AxiomVision provides three core innovations: (1) a dynamic visual model selection mechanism utilizing continual online learning, (2) an efficient online method that efficiently takes into account the influence of the camera's perspective, and (3) a topology-driven grouping approach that accelerates the model selection process. With rigorous theoretical guarantees, these advancements provide a scalable and effective solution for visual tasks inherent to multimedia systems, such as object detection, classification, and counting. Empirically, AxiomVision achieves a 25.7\% improvement in accuracy.

accuracy, axiomvision, visual model, (13 more...)

2407.20124

Country:

Oceania > Australia > Victoria > Melbourne (0.05)
Asia > China > Hong Kong (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry:

Information Technology (0.67)
Education > Educational Setting (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Ryumina, Elena, Markitantov, Maxim, Ryumin, Dmitry, Kaya, Heysem, Karpov, Alexey

Audio-Visual Compound Expression Recognition Method based on Late Modality Fusion and Rule-based Decision

arXiv.org Artificial IntelligenceMar-29-2024

This paper presents the results of the SUN team for the Compound Expressions Recognition Challenge of the 6th ABAW Competition. We propose a novel audio-visual method for compound expression recognition. Our method relies on emotion recognition models that fuse modalities at the emotion probability level, while decisions regarding the prediction of compound expressions are based on predefined rules. Notably, our method does not use any training data specific to the target task. Thus, the problem is a zero-shot classification task. The method is evaluated in multi-corpus training and cross-corpus validation setups. Using our proposed method is achieved an F1-score value equals to 22.01% on the C-EXPR-DB test subset. Our findings from the challenge demonstrate that the proposed method can potentially form a basis for developing intelligent tools for annotating audio-visual data in the context of human's basic and compound emotions.

dimitrio kollia, emotion, recognition, (15 more...)

2403.12687

Country:

Europe > Russia > Northwestern Federal District > Leningrad Oblast > Saint Petersburg (0.05)
Asia > Russia (0.05)
Europe > Netherlands (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision > Face Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsMar-13-2024, 17:37:18 GMT

7cce53cf90577442771720a370c3c723-Reviews.html

Paper 1048 proposes a system for large-scale zero-shot visual recognition. It consists of the following steps: (1) Learn an embedding of a large number of words in a Euclidean space. The 1,000 categories are a subset of the'large number of words' of step (1). Replace it by the word embeddings and add a layer to map the core visual model output to the word embeddings. On the positive side: The paper is well written and reads easily.

justification, recognition, visual model, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.60)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.58)